Principal Component Analysis (PCA)

Mezmur Wossenu Edo

Outline

  • Motivation

  • Intuition

  • Theory

  • PCA in action

Motivation

  • Computation Efficiency

  • Feature Extraction

  • Visualization

  • Curse of dimensionality

Curse of dimensionality

Intuition

Find orthogonal directions of maximum variance in data. Directions with sufficently low variance in the data can be removed.

We can then project the data onto the direction \vec{p}_1.

Theory

Let x be a datapoint with features f_1, f_2, f_3, …, f_n,

x = \begin{pmatrix} f_1\\ f_2\\ f_3\\ .\\ .\\ .\\ f_n \end{pmatrix}

The projection of x onto p is then,

x^{T} \frac{p}{||p||}

Hence, the projection of all datapoints onto the principal component direction, p, can be written as ,

\begin{pmatrix} x_1^{T} \frac{p}{||p||}\\ x_2^{T} \frac{p}{||p||}\\ x_3^{T} \frac{p}{||p||}\\ .\\ .\\ .\\ x_m^{T} \frac{p}{||p||} \end{pmatrix} = X\frac{p}{||p||}

where:

  • X is the design matrix consisting m datapoints.

The Optimization Problem

Let \bar{x} be the sample mean vector,

\bar{x} = \frac{1}{m}\sum_{i=1}^{m}x^{(i)}

the sample covariance matrix is then given by,

S = \frac{1}{m} X^TX - \bar{x}\bar{x}^T

where:

  • S_{ij} is the covarance of feature i and feature j.

For a sample mean of the projected data, \bar{a},

\bar{a} = \frac{1}{m}\sum_{i=1}^{m}x^{(i)T}p = \bar{x}^Tp

the sample variance of the projected data can be written as,

\sigma^{2}= \frac{1}{m}\sum_{i=1}^{m}(x^{(i)T}p)^2 - \bar{a}^{2} = p^{T}Sp

Then our optimization problem is,

\max_p \space p^{T}Sp \space s.t. ||p||=1

which has the solution,

Sp = \lambda p

Sklearn Implimentation

Computation can be done using the single value decomposition of X.

X = U \Sigma V^T

If the data is mean-centered (the default option in sklearn), the sample covariance matrix is given by,

S = \frac{1}{m} X^TX = \frac{1}{m} V\Sigma U^T U \Sigma V^T = V\frac{1}{m}\Sigma^2V^T

which is the eigenvalue decomposition of S, with eigenvectors as the columns of V and corresponding eigenvalues as diagonal entries of \frac{1}{m}\Sigma^2.

The variance explained by p_j is \lambda_{j}. Hence, the total variance is,

\sum_{j=1}^{n} \lambda_j = trace(s)

and the total variance explained by the first k principal components is then,

\frac{\sum_{j=1}^{k} \lambda_j}{trace(s)}

PCA in Action

Code
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

#generate simulated data
np.random.seed(42)
n_samples = 200

feature1 = np.random.rand(n_samples) 
feature2 = feature1 * 1000  

X = np.column_stack((feature1, feature2))

#pCA without scaling
pca_no_scale = PCA(n_components=1)
X_pca_no_scale = pca_no_scale.fit_transform(X)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
#pCA with scaling
pca_scaled = PCA(n_components=1)
X_pca_scaled = pca_scaled.fit_transform(X_scaled)

print("First principal component (no scaling):", pca_no_scale.components_)
print("First principal component (scaling):", pca_scaled.components_)
First principal component (no scaling): [[0.001     0.9999995]]
First principal component (scaling): [[0.70710678 0.70710678]]

Without Scaling

With Scaling

Summary

  • PCA is a dimensionality reduction technique that projects data onto directions which explain the most variance in the data.

  • The principal component directions are eigenvectors of the sample covariance matrix, with corresponding eigenvalues representing the variance explained.

  • For proper implementation of PCA, data must be mean-centered, sklearn defaut, and scaled.